# Load packages and import data
library(ggplot2)
library(dplyr)
library(readr)
library(stringr)
library(tidyr)
bikes <- read_csv("https://mac-stat.github.io/data/bikeshare.csv")
# A little bit of data wrangling code - let's not focus on this for now
campaigns <- read_csv("https://mac-stat.github.io/data/campaign_spending.csv") %>%
dplyr::select(wholename, district, votes, incumbent, spending) %>%
mutate(spending = spending / 1000) %>%
filter(!is.na(spending))
# A little bit of data wrangling code - let's not focus on this for now
cars <- read_csv("https://mac-stat.github.io/data/used_cars.csv") %>%
mutate(milage = milage %>% str_replace(",","") %>% str_replace(" mi.","") %>% as.numeric(),
price = price %>% str_replace(",","") %>% str_replace("\\$","") %>% as.numeric(),
age = 2025 - model_year) # 2025 so that yr. 2024 cars are one year oldMultiple linear regression: interaction terms practice (Notes)
STAT 155
Notes
Learning goals
By the end of this lesson, you should be able to:
- Visualize interactions between categorical and quantitative predictors using scatterplots and side-by-side or boxplots
- Critically think through whether an interaction term makes sense, or should be included in a multiple linear regression model
- Write a model formula for a multiple linear regression model with an interaction term between two quantitative predictors, two categorical predictors, or one quantitative and one categorical predictor
- Interpret the intercept and slope coefficients in a multiple linear regression model with an interaction term
Readings and videos
Choose either the reading or the videos to go through before class.
- Reading: Section 3.9.3 in the STAT 155 Notes
- Video:
File organization: Save this file in the “Activities” subfolder of your “STAT155” folder.
Exercises
Context: Today we’ll explore data on incumbency and campaign spending, revisit the bikes data we’ve looked at previously in this course, and explore data on characteristics of used cars. Read in the data below.
For the first several exercises, we’ll consider the following research questions:
What role does campaign spending play in elections?
- Do candidates that spend more money tend to get more votes?
- How might this depend upon whether a candidate is an incumbent (they are running for RE-election) or a challenger (they are challenging the incumbent)?
We’ll use data collected by Benoit and Marsh (2008) on the campaign spending of 464 candidates in the 2002 Irish Dail elections (Ireland’s version of the U.S. House of Representatives) to explore these questions. The units of spending are 1,000 Euros.
Exercise 1: Translating scientific questions into statistical questions
- Look at the variables we have access to in the cleaned version of the data we read into R, and consider our first research question. How might we translate this question into a statistical one, that we could answer using the data we have available?
There is no one right answer to this! Brainstorm with your group.
head(campaigns)
## # A tibble: 6 × 5
## wholename district votes incumbent spending
## <chr> <chr> <dbl> <chr> <dbl>
## 1 Aengus O Snodaigh Dublin South Central 5591 No 28.9
## 2 Aidan McMahon Louth 294 No 0.557
## 3 Aidan Ryan Limerick East 19 No 2.24
## 4 Aine Ni Chonaill Dublin South Central 926 No 4.08
## 5 Alan Dukes Kildare South 4967 Yes 12.1
## 6 Alan Shatter Dublin South 5363 Yes 11.9- Question 2 (a) is a bit more specific than Question 1. Translate this question into a statistical one that can be answered using a simple linear regression model. Write out the model statement in \(E[Y | X] = ...\) notation that would answer this question, and note which regression coefficient you would interpret to provide you with an answer.
\[ E[___ | ___] = ... \]
- Question 2 (b) is also specific, and builds on Question 2 (a). Translate this question into a statistical one that can be answered using a multiple linear regression model. Write out the model statement in \(E[Y | X] = ...\) notation that would answer this question, and note which regression coefficient you would interpret to provide you with an answer.
\[ E[___ | ___] = ... \]
Exercise 2: Visualizing Interaction
- Write R code to visualize the relationship between campaign spending and number of votes a candidate received. Include an aesthetic to distinguish this relationship between incumbents and challengers. Do not include lines of best fit from any statistical model on your plot at this point!
# VisualizationBased on your visualization from part (a), what are your answers to research questions 2 (a) and 2 (b)? Write your answer in 2-3 sentences, describing general trends you notice, suitable for a general audience.
Add lines of best fit from a statistical model that includes an interaction term between incumbent status and spending to your plot from part (a), using
geom_smooth. Based on your updated plot, do you think including an interaction between incumbent status and spending in a multiple linear regression model would be meaningful in this context? Why or why not?
# Visualization with lines of best fitExercise 3: Fitting and interpreting models with interaction terms
- Fit the regression model you wrote out in Exercise 1 (c). Report (do not interpret yet!) the regression coefficients below.
# Model with interaction term(Intercept):
incumbentYes:
spending:
incumbentYes:spending:
- Using the coefficient estimates from part (a), write out two separate model statements, one for incumbents and one for challengers. Combine terms (using algebra) when you can! Hint: remember the indicator variables video!
- For incumbents:
\[ E[votes | spending] = \]
- For challengers:
\[ E[votes | spending] = \]
Interpret the coefficient for
incumbentin your interaction model, in context. Make sure to use non-causal language, include units, and talk about averages rather than individual cases. Is this coefficient scientifically meaningful?When interpreting an interaction coefficient where one of the variables interacting is quantitative and one is categorical, it is often convenient to do so in separate sentences: interpret the slope for each category separately!
Interpret the coefficient for the interaction term in your model, in context. Make sure to use non-causal language, include units, and talk about averages rather than individual cases.
- Based on your interpretation in part (d), and the visualization you made including lines of best fit, do you think that including an interaction term for incumbent status and spending is meaningful, when predicting number of votes? Explain why or why not.
Exercise 4: Interactions between two categorical variables
Let’s return to our data on bike ridership. Suppose we are interested in the relationship between daily ridership (our response variable) and whether a user is a casual or registered rider and whether the day falls on a weekend. First, we need to create a binary variable indicating whether a user is a casual or registered rider.
# Creating user variable, don't worry about syntax!
new_bikes <- bikes %>%
dplyr::select(riders_casual, riders_registered, weekend, temp_actual) %>%
pivot_longer(cols = riders_casual:riders_registered, names_to = "user",
names_prefix = "riders_", values_to = "rides") %>%
mutate(weekend = factor(weekend))- For each of our three relevant variables,
weekend,user, andrides, classify them as quantitative or categorical.
weekend:
user:
rides:
- Make an appropriate visualization to explore the relationship between these three variables.
# VisualizationIs the relationship between ridership and weekend status the same for both registered and casual users? Explain why or why not, referencing the visualization you made in part (b).
To reflect what you observed in your visualization, fit a multiple linear regression model with an interaction term between
weekendanduserin our model of ridership.
# Multiple linear regression model- Interpret the interaction term from your model, in context. Make sure to use non-causal language, include units, and talk about averages rather than individual cases. Just as in Exercise 3, you may find it useful to first write out multiple model statements for different categories defined by one of your categorical variables, and proceed from there!
Exercise 5: Interactions between two quantitative variables
Here we’ll explore the relationship between price, milage, and age of a used car. Below is a scatterplot of mileage vs. price, colored by age:
cars %>%
ggplot(aes(x = milage, y = price, col = age)) +
geom_point(alpha = 0.5) + # make the points less opaque
scale_color_viridis_c(option = "H") + # a fun, colorblind-friendly palette!
theme_classic() # removes the gray background and gridIt’s a little difficult to tell what exactly is going on here. In particular, does the relationship between mileage and price vary with age of a used car? Let’s try adding some fitted lines for cars of different ages.
# Ignore where the numbers in geom_abline() came from for now... we'll get there
cars %>%
ggplot(aes(x = milage, y = price, col = age)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c(option = "H") +
theme_classic() +
geom_abline(slope = -6.558e-01 + 2.431e-02, intercept = 9.096e+04 -2.665e+03, col = "black") +
geom_abline(slope = -6.558e-01 + 10 * 2.431e-02, intercept = 9.096e+04 - 10 * 2.665e+03, col = "blue") +
geom_abline(slope = -6.558e-01 + 30 * 2.431e-02, intercept = 9.096e+04 - 30 * 2.665e+03, col = "green") +
ggtitle("Black: Age = 1yr, Blue: Age = 10yr, Green: Age = 30yr")- Challenge question: Based on the fitted lines in the plot above, anticipate what the signs (positive or negative) of the coefficients in the following interaction model will be:
\[ E[price | age, milage] = \beta_0 + \beta_1 milage + \beta_2 age + \beta_3 milage:age \] * \(\beta_0\): Put your response here…
\(\beta_1\): Put your response here…
\(\beta_2\): Put your response here…
\(\beta_3\): Put your response here…
- Fit a multiple linear regression model with an interaction term between
milageandagein our model of used carprice.
# Multiple linear regression model
# ... now do you see where the numbers in geom_abline() came from?As before, we could choose distinct ages, and interpret the relationship between mileage and price for each of those groups separately. However, since age is quantitative and not categorical, this doesn’t quite give us the whole picture. Instead, we want to know how the relationship between mileage and price changes for each additional year old a car is. This is what the interaction coefficient estimates, when the interaction term is between two quantitative variables!
- Interpret the interaction term, in context. Make sure to use non-causal language, include units, and talk about averages rather than individual cases.
Reflection
Through the exercises above, you practiced visualizing, fitting, and interpreting multiple linear regression models with interaction terms between combinations of categorical and quantitative variables. Think about how the fitted lines looked in situations where you think there was a meaningful interaction taking place. How do you think the fitted lines would look if there was no meaningful interaction present? Explain your reasoning.
Response: Put your response here.
Done!
- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer – check that your work translated correctly; and (3) Outside RStudio, navigate to your ‘Activities’ subfolder within your ‘STAT155’ folder and locate the HTML file – you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the ‘Stop’ button in the ‘Background Jobs’ pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework!